Skip to content

Fixes test_dns_resolv_conf.py failures due to stale DNS config in some of the containers#26188

Open
purush-nexthop wants to merge 1 commit intosonic-net:masterfrom
nexthop-ai:purush-fix-stale-dns-config
Open

Fixes test_dns_resolv_conf.py failures due to stale DNS config in some of the containers#26188
purush-nexthop wants to merge 1 commit intosonic-net:masterfrom
nexthop-ai:purush-fix-stale-dns-config

Conversation

@purush-nexthop
Copy link
Copy Markdown

@purush-nexthop purush-nexthop commented Mar 14, 2026

Why I did it

test_dns_resolv_conf.py removes DNS config and checks if /etc/resolv.conf is updated in all containers. Test fails as some of teh containers like pmon/restapi still had stale DNS config on their /etc/resolv.conf

Work item tracking
  • Microsoft ADO (number only):

How I did it

When DNS config is removed,

All containers are restarted
config load minigraph is called with empty nameserver which results in all containers getting restarted.
Updating host /etc/resolv.conf
2a. Hostcfd (sonic-host-services/scripts/hostcfgd) detects changes to restarts resolv.config service
2b. /usr/bin/resolv-config.sh updates /etc/resolv.conf
Networking service restarts
(3a) config load minigraph restarts sonic target which results in triggerig interfaces-config.service to call interfaces-config.sh
(3b) interfaces-config.sh calls "systemctl restart networking" to restart the networking service
Per container /etc/resolv.conf update
When containers come up, update containers get called with container name as their arg. Pls see files/build_templates/docker_image_ctl.j2. Update-containers when called with container name, reads updates /etc/resolv.conf in the container based on the contents of host /etc/resolv.conf.
(4a). If containers come up after the host /etc/resolv.conf is updated - NO ISSUES
(4b) Containers that came up before the host /etc/resolv.conf update will have stale config in them
Bulk update to containers -
Any changes to /etc/resolv.conf willl result in running all teh scripts under /etc/resolvconf/update-libc.d/ path. As a result udpate-containers should get called without the container name (to update all containers).
During bulk update, udate containers goes through all active containers and call update_container_resolv for each container which updates /etc/resolv.conf inside container based on values read from host etc/resolv.conf.
update-containers exits early in case if networking service is not up. This came in as part of -
3a143ad
Optimization to skip updates during warm reboot

Sequence that lands in problem state:

All containers restarted along with networking service
pmon/restapi came early with stale DNS config. All other containers came in after the update to host /etc/resolv.conf.
update_containers is called for bulk update. Exits early becos networking service is down
networking service comes back up
pmon/restapi still continues to have stale config
Fix here removes the networking check in update containers.

How to verify it

Ran the test multiple times on different products and ensured that the test passes consistently.

Which release branch to backport (provide reason below if selected)

  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Mar 14, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: purush-nexthop / name: purush-nexthop (58f625c)

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@purush-nexthop purush-nexthop force-pushed the purush-fix-stale-dns-config branch from e7bb017 to 846257c Compare March 14, 2026 20:45
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@purush-nexthop purush-nexthop marked this pull request as ready for review March 16, 2026 17:02
@purush-nexthop purush-nexthop requested a review from lguohan as a code owner March 16, 2026 17:02
…alls to update-containers can update containers with stale dns config

Signed-off-by: purush-nexthop <[email protected]>
@purush-nexthop purush-nexthop force-pushed the purush-fix-stale-dns-config branch from 846257c to 58f625c Compare March 18, 2026 05:00
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@judyjoseph
Copy link
Copy Markdown
Contributor

@oleksandrivantsiv @mlok-nokia please can you review

fi

# Check if networking service is active (only for bulk updates)
networking_status=$(systemctl is-active networking.service 2>/dev/null)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will affect the performance of the config reload command, as it will run an additional bulk update for all containers after the networking service is stopped.

Is the issue you are trying to fix a recent degradation?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the following PRs. They fix similar issues:
#25991
sonic-net/sonic-utilities#4365

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing this.
No, this change is not to fix the degradation. During config reload (after removing the DNS config), some of the containers come up with stale config and they never get updated (as update_containers was bailing out becos of the networking check). Let me take a look at the PRs you posted and see if it could help here.

@vrajeshe
Copy link
Copy Markdown

@Aravind-Subbaroyan we are seeing similar issue in 202511 runs.

@arlakshm arlakshm requested a review from qiluo-msft March 25, 2026 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

6 participants